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Abstract 

We investigate the representation of hierarchical models in terms of marginals of 
other hierarchical models with smaller interactions. We focus on binary variables and 
marginals of pairwise interaction models whose hidden variables are conditionally in¬ 
dependent given the visible variables. In this case the problem is equivalent to the 
representation of linear subspaces of polynomials by feedforward neural networks with 
soft-plus computational units. We show that every hidden variable can freely model 
multiple interactions among the visible variables, which allows us to generalize and 
improve previous results. In particular, we show that a restricted Boltzmann machine 
with less than [2(log(u) + l)/(u + 1)]2" LI — 1 hidden binary variables can approximate 
every distribution of v visible binary variables arbitrarily well, compared to 2” _1 — 1 
from the best previously known result. 

Keywords: hierarchical model, restricted Boltzmann machine, interaction model, con- 
nectionism, graphical model 


1 Introduction 

Consider a finite set V of random variables. A hierarchical log-linear model is a set of 
joint probability distributions that can be written as products of interaction potentials, as 
p(x) = IIaIM*). w -here ip \(x) = ip\(x a) only depends on the subset A of variables and 
where the product runs over a fixed family of such subsets. By introducing hidden variables, 
it is possible to express the same probability distributions in terms of potentials which involve 
only small sets of variables, as p(x) = J2 y Tl\ip\(x,y), with small sets A. Using small 
interactions is a central idea in the context of connectionistic models, where the sets A are 
often restricted to have cardinality two. Due to the simplicity of their local characteristics, 
these models are particularly well suited for Gibbs sampling [J|. The representation, or 
explanation, of complex interactions among observed variables in terms of hidden variables 
is also related to the study of common ancestors [36]. 

We are interested in sufficient and necessary conditions on the number of hidden vari¬ 
ables, their values, and the interaction structures, under which the visible marginals are 
flexible enough to represent any distribution from a given hierarchical model. Many prob¬ 
lems can be formulated as special cases of this general problem. For example, the problem 


of calculating the smallest number of layers of variables that a deep Boltzmann machine 
needs in order to represent any probability distribution (b . 

In this article, we focus on the case that all variables are binary. For the hierarchical 
models with hidden variables, we restrict our attention to models involving only pairwise 
interactions and whose hidden variables are conditionally independent given the visible vari¬ 
ables (no direct interactions between the hidden variables). A prominent example of this 
type of models is the restricted Boltzmann machine, which has full bipartite interactions be¬ 
tween the visible and hidden variables. The representational power of restricted Boltzmann 
machines has been studied assiduously; see, e.g., mmmm- The free energy function of 
such a model is a sum of soft-plus computational units x K > log(l + exp (^2 i&v WiXi + c)). 
On the other hand, the energy function of a fully observable hierarchical model with binary 
variables is a polynomial, with monomials corresponding to pure interactions. Since any 
function of binary variables can be expressed as a polynomial, the task is then to charac¬ 
terize the polynomials computable by soft-plus units. 

Younes m showed that a hierarchical model with N binary variables and a total of 
M pure higher order interactions (among three or more variables) can be represented as 
the visible marginal of a pairwise interaction model with M hidden binary variables. In 
Younes’ construction, each pure interaction is modeled by one hidden binary variable that 
interacts pairwise with each of the involved visible variables. In fact, he shows that this 
replacement can be accomplished without increasing the number of model parameters, by 
imposing linear constraints on the coupling strengths of the hidden variable. In this work 
we investigate ways of squeezing more degrees of freedom out of each hidden variable. An 
indication that this should be possible is the fact that the full interaction model, for which 
M = 2 N — ((J) — N — 1, can be modeled by a pairwise interaction model with 2 N ~ l — 1 
hidden variables [10]. Indeed, by controlling groups of polynomial coefficients at the time, 
we show that in general less than M hidden variables are sufficient. 

A special case of hierarchical models with hidden variables are mixtures of hierarchical 
models. The smallest mixtures of hierarchical models that contain other hierarchical models 
have been studied in [8]. The approach followed there is different and complementary to 
our analysis of soft-plus polynomials. For the necessary conditions, the idea there is to 
compare the possible support sets of the limit distributions of both models. For the sufficient 
conditions, the idea is to find a small S-set covering of the set of elementary events. An S -set 
of a probability model is a set of elementary events such that every distribution supported 
in that set is a limit distribution from the model. 

Another type of hierarchical models with hidden variables are tree models. The geometry 
of binary tree models has been studied in |18] in terms of moments and cumulants. That 
analysis bears some relation to ours in that it also elaborates on Mobius inversions. 

This paper is organized as follows. Section [2] introduces hierarchical models and formal¬ 
izes our problem in the light of previous results. Section [3] pursues a characterization of the 
polynomials that can be represented by soft-plus units. Section [4] applies this characteri¬ 
zation to study the representation of hierarchical models in terms of pairwise interaction 
models with hidden variables. This section addresses principally restricted Boltzmann ma¬ 
chines. Section [5] offers our conclusions and outlook. 
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2 Preliminaries 


This section introduces hierarchical models, with and without hidden variables, formalizes 
the problem that we address in this paper, and presents motivating prior results. 


2.1 Hierarchical Models 


Consider a finite set V of variables with finitely many joint states x = (xi)i e y G X = 
X ieV Xj. We write v = \V\ for the cardinality of V. For a given set S C 2 v of subsets of 
V let 

Vx,s := < g(x) = Y 9 a(x) : g A (x) = gA(x A ) > ■ 

{ Aes J 


This is the 
variables A 
S' is the set 


linear subspace of M x spanned by functions g A that only depend on sets of 
€ S. The hierarchical model of probability distributions on X with interactions 


£x,S := 



1 

W) 


exp (g(x)): g € Vx,s 1 , 


(1) 


where Z(g) = S^'ex ex P(g( x ')) is a normalizing factor. We call 


E(x) = g(x) = Y] gA(x) (2) 

a es 

the energy function of the corresponding probability distribution. 

For convenience, in all what follows we assume that S is a simplicial complex, meaning 
that A £ S implies B £ S for all B C A. Furthermore, we assume that the union of elements 
of S equals V. In the case of binary variables, Xj = {0,1} for all i £ V, the energy can be 
written as a polynomial, as 

E(x ) = Y JA\\_Xi. 

Aes ie A 

Here, J A & R, A £ S, are the interaction weights that parametrize the model. 


2.2 Hierarchical Models with Hidden Variables 

Consider an additional set H of variables with finitely many joint states y = G Y = 

X jeH Yj. We write h = \H\ for the cardinality of H. For a simplicial complex T C 2 VUH , 
let Vxxy.t C IR XxY be the linear subspace of functions of the form g{x,y) = Y2 \<et 9a{x , y), 
g\(x,y) = g\{(x,y)\). The marginal on X of the hierarchical model £xxy,t is the set 


Mxxy.t := 



1 

W) 


Y ex P (d{x, y))- g e Vxxy,t >, 

veY I 


(3) 


where Z(g) = Sx'gx y'eY ex P(fl , ( ;c/ ) u')) is again a normalizing factor. The free energy of a 
probability distribution from AtxxY ,t is given by 

F{x) =log^exp(^ 5A (a:, 2 /)). (4) 

ye y a er 
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Here and throughout “log” denotes the natural logarithm. 

In the case of binary visible variables, Xj = {0,1} for all i £ V, the free energy Q can 
be written as a polynomial, as 


F(x) = E Kb II Xl ' 

BCV ieB 

where the coefficients can be computed from Mobius’ inversion formula as 

Kb = (-l) |BVC| log^exp(^3 A ((l c ,0 nc .),y)), Be 2 V . (5) 

CCB ye Y XeT 

Here (1 c, 0y\c) € {0,1}' is the vector with value 1 in the entries i £ C and value 0 in the 
entries i ^ C. 

If there are no direct interactions between hidden variables, i.e., if |A n H\ < 1 for all 
A £ T, then the sum over y factorizes and the free energy Q can be written as 

F(x)= 9\{ x ) + E lo S E exp ( E 9x(x,Vj)y (6) 

A:An ff=0 j&H y,e¥j xer-.jex 

Particularly interesting are the models with full bipartite interactions between the set of 
visible variables and the set of hidden variables, i.e., models with T = {A C V U H: |AflH| < 
1, |A PI H\ < 1}, called restricted Boltzmann machines (with discrete variables). 


2.3 Problem and Previous Results 

In general the marginal of a hierarchical model is not a hierarchical model. However, one 
may ask which hierarchical models are contained in the marginal of another hierarchical 
model. 

To represent a hierarchical model in terms of the marginal of another hierarchical model, 
we need to represent ([I]) in terms of ([3]). Equivalently, we need to represent all possible 
energy functions in terms of free energies. Given a set of visible variables V and a simplicial 
complex S C 2 V , what conditions on the set of hidden variables H and the simplicial 
complex T C 2 VUH are sufficient and necessary in order for any function E of the form © 
to be representable in terms of some function F of the form Q? We would like to arrive at 
a result that generalizes the following. 


• A restricted Boltzmann machine with h hidden binary variables can approximate any 
probability distribution from a binary hierarchical model Es with |{ A £ S: |A| > 1}| < 
h arbitrarily well m- 

• The restricted Boltzmann machine with h = 2 V ~ 1 — 1 hidden binary variables can 
approximate any probability distribution of v binary variables arbitrarily well (10) . 


Our Theorem 11 in Section [4] improves and generalizes these statements. The basis of this 
result are soft-plus polynomials, which we discuss in the following section. 
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Figure 1: Illustration of a soft-plus computational unit. The possible inputs X = {0,1}^, 
corresponding to the vertices of the unit E-cube, are mapped to the real line by an affine 
map x H > w T x + c, and then the soft-plus non-linearity f: s log(l + exp(s)) is applied. 

3 Soft-plus Polynomials 

Consider a function of the form 

0: {0,1}' —>■ R; x >->• log(l + exp(iy T :r + c)), (7) 

parametrized by w = (w,;) iev € and c £ R. We regard ^ as a soft-plus computational 
unit, which integrates each input vector x £ {0,1}' into a scalar via x K»• w T x + c and 
applies the soft-plus non-linearity /: R —> R + ; s hP log(l + exp(s)). See Figure [l] for an 
illustration of this function. In view of Equation the function <f> corresponds to the 
free energy added by one hidden binary variable interacting pairwise with V visible binary 
variables. The parameters Wi, i £ V , correspond to the pair interaction weights and c to 
the bias of the hidden variable. 

What kinds of polynomials on {0,1} F can be represented by soft-plus units? Following 
Equation the polynomial coefficients of tf> are given by 

K B {w, c) = 53 (-l) 1 ^ 1 log (1 + exp ( + c) j , B e 2 V . (8) 

CCB \ iec J 

For each B £ 2 V this is an alternating sum of the values </>( x) of the soft-plus unit on 
the input vectors x £ {0,1}' with supp(x) C B. In particular, K B is independent of the 
parameters , i £ B. We will use the shorthand notation w B for (wi)i GB . 

Note that, if Wi = 0 for some i £ V, then Kq = 0 for all C £ 2 V with i £ C. In the 
following we focus on the description of the possible values of the highest degree coefficients. 
For example, Younes m showed that a soft-plus unit can represent a polynomial with an 
arbitrary leading coefficient: 

Lemma 1 (Lemma 1 in (T7]). Let B C V and Wi = 0 for i qL B. Then, for any J B £ R, 
there is a choice of w B £ R s and c £ R such that K B = J B . 

The idea of Younes’ proof of Lemma [l] is to choose all non-zero Wi of equal magnitude. 
This simplifies the calculations and reduces the number of free parameters to one. Our goal 
is to show that a soft-plus unit can actually freely model several polynomial coefficients at 
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Figure 2: Illustration of Lemma [2] Depicted is for each edge pair (B,B') the set of 
coefficient pairs (K B ,K B >) £ R 2 of the polynomials Y^cav Iliec x * expressible as 
log(l + exp(u; T a; + c)). Shown is also the set of monomials of partial degree one and degree 
at most 4, partially ordered by variable inclusion. 


the same time. Our approach to simplify the Mobius inversion formula ([8]) is to choose the 
parameters w and c in such a way that the function <fi has many zeros. Clearly this can only 
be done in an approximate way, since the soft-plus function is strictly positive. Nevertheless, 
these approximations can be made arbitrarily accurate, since log(l + exp(s)) < exp(s) is 
arbitrarily close to zero for sufficiently large negative values of s. 

We call a pair of sets ( B , B') an edge pair or a covering pair when B D B’ and there 
is no set C with B D C D B'. The next lemma shows that a soft-plus unit can jointly 
model the coefficients of an edge pair, at least in part. When the maximum degree \B\ is at 
most 3, the two coefficients are restricted by an inequality, but when \B\ > 4, there are no 
such restrictions. The result is illustrated in Figure [2] 

Lemma 2. Consider an edge pair (B,B'). Depending on \B\, for any e > 0 there is a 
choice ofw B £ R s and c £ R such that || (K B ,K B ') — ( Jb,Jb’)\\ < e if and only if 


Jb' > 0, — Jb, 

Jb 1 > 0, — J B or J B '<0,—J B , 
Jb' > 0, —Jb or J B '<0 ,—Jb, 
(Jb, Jb>) G K 2 , 


for \B\ = 1 
for \B\ = 2 
for \B\ = 3 
for \B\ > 4. 


Proof. This proof is deferred to Appendix [A] □ 

Remark 3. If (B^B') is an edge pair with \B\ = 3, then, despite having \B\ + 1 = 4 
parameters to vary (wi, i £ B, and c), we can only determine the polynomial coefficients 
K b and K b > up to a certain inequality. We expect that the same is true in general: If we 
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want to freely control k polynomial coefficients, we need strictly more than k parameters. 
Otherwise, the coefficients are restricted by some inequalities. This situation is common 
in models with hidden variables. In particular, mixture models often require many more 
parameters to eliminate such inequalities than expected from naive parameter counting [5]. 

It is natural to ask whether it is possible to control other pairs of coefficients or even larger 
groups of coefficients. We discuss a simple example before proceeding with the analysis of 
this problem. 


Example 4. Consider a soft-plus unit with two binary inputs. Write /: s log(l + exp(s)) 
for the soft-plus non-linearity and fo = /(c), /i = f{w i + c), f 2 = f{w 2 + c), /12 = 
f(w 1 + w 2 + c) for the values of the soft-plus unit on {0, l} 2 . From Equation it is easy 
to see that 

K z =/ 0 >0 

^{ 1 } = fi~fo> ~K 9 
K {2} = h-fo> ~K t . 


Now let us investigate the quadratic coefficient Ks li2 > 

= /12 — h — h + 

convexity of / we find 



0 < -^{ 1 , 2 }, 

if 

K{1},K{2} > 0 

(M 

kf 

VI 

CN 

VI 

0 

if 

-K{\},-K{2} > 0 

~K{i} < -^{ 1 , 2 } < 0, 

if 

K{\}i -K{2} > 0 

-K{2} < -^{1,2} < 0, 

if 

to 

IV 

0 


Using the 


Hence the computable polynomials have coefficient triples (Krn , Kr 2 \, ^{i )2 }) enclosed in 
a polyhedral region of R 3 as depicted in Figure [ 3 J However, any pair {K^},K^}) G M 2 is 
possible (for Kq large enough). 


The next lemma shows that a soft-plus unit can jointly model certain tuples of polynomial 
coefficients corresponding to v — k + 1 monomials of degree k. We call star tuple a set of 
the form {B U {j} : j £ B'j, where B, B' C V satisfy B n B' = 0. Each element of the star 
tuple covers the set B. In the Hasse diagram of the power set 2 V , the sets B U {j},j £ B', 
are the leaves of a star with root B. 


Lemma 5. Consider any B,B' C V with B D B' = 0. Let Wi = 0 for i B U B'. Then, 
for any Jbu{j} G R , j £ B', and e > 0, there is a choice of wbub' £ R BUB and c £ R such 
that \Kbu{j} ~ >^Bu{j}l < e for all j £ B', and \Kc\ < e for all C ^ B , B U {j}, j £ B'. 

Proof of Lemma [3| Since Wi = 0 for i qL B U B', we have that Kq = 0 for all C % B U B’. 
We choose c = —(|£?| — |)w, w, = w for all i £ B, and Wj = Jbu{j} for j G B'. Choosing 
w > Lie*' N y ields /(Eiger w, + c) ~ 0 for all C^B. In this case, 

K c w 0, for all B % C C B U B'. 

Furthermore, for all j £ B' we have 

kbu{ 3} ~/ (yi w i + w j + c ) - +c) 

ieB ieB 

=f(JBU{j} + \w) - f(\w) 

~{JBU{j} + |w) — (|w) = Jbu{j}- 

Similarly, KbuC ~ 0 for all C C B' with \C\ > 2. Note that Kb ~ \w. □ 
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Figure 3: Illustration of Example [ 4 ] Depicted is a region of R 3 , clipped to [—1, l] 3 , which 
contains the coefficient triples (K^y, K{ 2 y, Ki 12 }) G R 3 of the polynomials computable by 
a soft-plus unit with two binary inputs. This region consists of 4 solid convex cones. 


The intuition behind Lemma [5] is simple. When + c 1> the values w T x + c, 

for x with Xi = 1, i £ B, fall in a region where the soft-plus function is nearly linear. In 
turn, the soft-plus unit is nearly a linear function of Xj , j € B , with coefficients Wj, j € B '. 

Remark 6. Closely related to soft-plus units are rectified linear units , which compute 
functions of the form 


<p\ {0, 1} V —>■ R; x max{0, w T x + c}. 

In this case the non-linearity is s 1 —>■ {0, s}. This reflects precisely the zero/linear behavior of 
the soft-plus activation for large negative or positive values of s. Our polynomial descriptions 
are based on this behavior and hence they apply both to soft-plus and rectified linear units. 

We close this section with a brief discussion of dependencies among coefficients. The 
next proposition gives a perspective on the possible values of the coefficient I\b , depending 
on w mi once %\{ m } and c have been fixed. 

Proposition 7. Let ( B , B') be an edge pair with B' = B \ {to} and let Jb G R. For fixed 
wb 1 G R b and c G R, there is some w m G R such that Kb = Jb if o,nd only if a certain 
degree-2^ B I -1 polynomial in one real variable has a positive root. 

Proof of Proposition [?| Observe that 

K B (W,C) = K B '(WB',C+Wm) - K B '(W B ',C). 

Hence Kb = Jb if and only if Kb'{wb' ,c + w m ) = Kb'(wb',c) + Jb ='■ r. We use the 
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abbreviation t = e 4 , which implies positivity. We have 


K B >(w B ',c + w m ) = ^2 (-1) |B ' VC| log ( 1 +exp + c + Wr2j j 

ccs' \ iec / 


iQ g (n ( 

\CCB> 


1 + W m C Wi 


iec 




Now, K B i ( w B ',c + w m ) = r if and only if 

n !+^n U), 

CCB' icC 


(- 1 ) 


B'\C| 


or, equivalently, 


n ( x+ no- f n ( i +™ m c n m) = o. 

CCS': ieC CCB': iCC 

B'\C even B'\C odd 

This is a polynomial of degree at most 21 B I -1 in w m = e Wm . □ 

This description implies various kinds of constraints. For example, by Descartes’ rule of 
signs, a polynomial can only have positive roots if the sequence of polynomial coefficients, 
ordered by degree, has sign changes. 

4 Conditionally Independent Hidden Variables 

In the case of a bipartite graph between V and H with all variables binary, the hierarchical 
model (or rather its visible marginal) is a restricted Boltzmann machine, denoted RBMv^. 
This model is illustrated in Figure |4j The free energy takes the form 

F(x) = ^2 b i x i + X! l0g ( 1 + exp ( X! W 0 i X i + C t) 

iCV jCH \ iBV 

This is the sum of an arbitrary degree-one polynomial, with coefficients bi € R, i £ V, 
and h = \H\ independent soft-plus units, with parameters Wji € R, j € H,i £ V and 
Cj g R, j g H. The free energy contributed by the hidden variables can be thought of as a 
feedforward network with soft-plus computational units. 

We can use each soft-plus unit to model a group of coefficients of any given polynomial, 
starting at the highest degrees. Using the results from Section [3] we arrive at the following 
representation result: 

Theorem 8. Every distribution from a hierarchical model £s on {0,1}' can be approxi¬ 
mated arbitrarily well by distributions from RBMy;// whenever there exist h sets Bi,..., B/, C 
2 V which cover {A g S: |A| > 2} in reverse inclusion order, where each Bj is a star tuple 
or an edge pair of sets of cardinality at least 3. 
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Figure 4: A restricted Boltzmann machine. The free energy contributed by the hidden units 
is a sum of independent soft-plus units. 


Proof of Theorem [S| We need to express the possible energy functions of the hierarchical 
model as sums of independent soft-plus units plus linear terms. This problem can be reduced 
to covering the appearing monomials of degree two or more by groups of coefficients that 
can be jointly controlled by soft-plus units. In view of Lemmas [2] and [5j edge pairs with 
sets of cardinality 3 or more and star tuples can be jointly controlled. We start with the 
highest degrees and cover monomials downwards, because setting the coefficients of a given 
group may produce uncontrolled values for the coefficients of smaller monomials. Since S is 
a simplicial complex, we only need to cover the elements of S. O 

Finding a minimal covering is in general a hard combinatorial problem. In the following 
we derive upper bounds for the fc-interaction model, which is the hierarchical model £s with 
S = Sk := {A C V: |A| < k}. We will focus on star tuples and consider individual coverings 
of the layers ( X j) = {A C V: |A| = j}. Let v = \V\. Denote D(v,j) the smallest number 

of star tuples that cover ). We use the following notion from the theory of combinatorial 
designs (see [I] for an overview on that subject). For integers v > k > r denote C(v, k, r) the 
smallest possible number of elements of (^) such that every element from ('(') is contained 
in at least one of them. 

Lemma 9. For 0 < j < v, the minimal number of star tuples that cover ( v ) is D(v,j) = 
C(v, v — j + 1, v — j). Inserting known results for C(v,t + 1, £) we obtain the exact values 


D(v, 1) 
D(v, 2) 
D(v, 3) 
D{v, v — 3) 
D(v, v — 2) 
D{v, v — 1) 
D(v, v) 

and the general bound 

D(v,j ) < 


1 

V — 1 


v— 1 
v—3 

v f v— 1 f v —2 ~ 


v—2 

v_ r v—i r 1 

4 I 3 I 

I mi 

II 


rn 


= 1 


1 + log(u - j + 1) 


(v 7 mod 12) 


0 < j < v. 


v- j + 1 

Furthermore, we have the simple bound D(y,j) < 0 < j < v. 
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Proof of Lemma A star tuple covering of ('.) is given by a collection Bi,... ,B n of ele¬ 
ments of ( such that every element of ( v ) contains at least one of the B^. The minimal 
possible number of elements in such a collection is precisely D(v,j) = C{v, v — j + 1, v — j). 
The equalities follow from corresponding equalities for C(y, v — j + 1, v — j) by several au¬ 
thors, which are listed in HU- The inequality follows from a result by Erdos and Spencer g 


showing that C(v, k, r) < ({) / (*) 


1 + log 0 


The simple bound results from the fact 


that each set B from ( v ) contains a set B' from ( V ^^)- 


□ 


Remark 10. Lemma [9] presents widely applicable bounds on the cardinality of star tuple 
coverings, which are naturally not always tight. For v < 28, better individual bounds on 
C(v,t + 1 ,t) can be found in p21 Table III]. See also [14] for a list of known exact values. 
In another direction m offers optimal asymptotic bounds on C{v, k, r) for fixed k and r. 


Lemma [9] allows us to formulate the following more explicit version of Theorem [8j 


Theorem 11. Let 1 < k < v. Every distribution from the k-interaction model £s k on 
{0,1}' can be approximated arbitrarily well by distributions from RBMy# whenever h 
surpasses or equals U(v, k ) = D{y,j), which is bounded above as indicated in LemmaQ 

This is the case, in particular, whenever h > Q~i) or ^ — loS ^]}^ +1 • 


Proof of Theorem\l 1\ This follows directly from Theorem [8] and Lemma [9] For the last 
statement we use the simple bound from the lemma, by which D(v,j) < and the 


general bound, by which D(v,j) < log ( I, ^+ 1 >+ 1 ("+ 1 ). 


□ 


In order to provide a numerical sense of Theorem |11| we give upper bounds on U(v, k), 
2 < k < v < 14, in Table [l] For convenience we also provide an Octave [5] script for 
computing such bounds in http://personal-homepages.mis.mpg.de/montufar/starcover.rn 
In the special case k = v, the fc-interaction model £s k is the full interaction model 


entails the following universal approximation result: 


and contains all (strictly positive) probability distributions on {0,1} . Hence Theorem 
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Corollary 12. Every distribution on {0,1}' can be approximated arbitrarily well by dis¬ 
tributions from whenever h surpasses or equals U(v,v), which is bounded above 

as indicated in Lemma This is the case, in particular, whenever h > 2 1 ’~ 1 — 1 or 
h > 2(log( J~ 1 1)+1) (2 V - (v + 1) - 1) + 1. 

Corollary [ 12 ] provides a significant and unexpected improvement the best previously 
known upper bound 2 V ~ 1 — 1 from [TO] . Whether the upper bound 2 V ~ 1 — 1 was optimal 
or not had remained an open problem in m and several succeeding papers. In Table [ 2 ] we 
give upper bounds on U{v,v), 2 < v < 40, and compare these with the previous result. 


Remark 13. In general an RBM can represent many more distributions than just the 
interaction models described above. For several small examples discussed further below, 
our bounds for the representation of interaction models are tight. However, Theorem 11 is 


based on upper bounds on a specific type of coverings and we suspect that it can be further 
improved, at least in some special cases, even if not reaching the hard lower bounds coming 
from parameter counting. 
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15 

20 

34 

53 

80 

84 

172 

124 

343 

182 

570 

251 

908 

453 

1385 

1002 

2068 

6 

“ 

- 

” 


21 

31 

38 

64 

84 

109 

184 

175 

373 

282 

742 

427 

1276 

750 

2107 

1473 

3389 

7 

” 

- 

” 



39 

63 

68 

85 

121 

189 

205 

390 

348 

789 

559 

1534 

1014 

2705 

1944 

4652 

8 

” 

- 

” 



- 

69 

127 

126 

190 

222 

395 

395 

808 

672 

1591 

1259 

3078 

2452 

5583 

9 

” 

- 

” 





127 

255 

227 

396 

414 

814 

729 

1615 

1416 

3156 

2823 

6105 

10 









228 

511 

420 

815 

753 

1621 

1494 

3182 

3053 

6196 

11 










421 

1023 

759 

1622 

1520 

3189 

3144 

6229 

12 











760 

2047 

1527 

3190 

3177 

6236 

13 

” 

- 

” 



- 






1528 

4095 

3184 

6237 

14 

- 

- 

- 

- 

- 

- 

- 

- 

- 

- 

- 


3185 

8191 


Table 1: Upper bounds on the minimal number of hidden units for which RBMy# can 
approximate every distribution from the fc-interaction model £s k on {0, l} 1 arbitrarily well, 


following from Theorem 11 for 2 < k < v < 14. Shown are upper bounds on U(v,k) = 
Y^ k j =2 D( y ,j) evaluated using Lemma [ 9 ] and some individual bounds on D(v,j) = C(v,v — 
j + l,v—j) from (T3J Table III]. Upper scripts indicate values obtained using only Lemmal9] 
Lower scripts indicate the previous RBM universal approximation bound 2 V ~ 1 — 1 from llOf . 
Entries with v < 9 or k < 3 are exact values of U(v,k). 
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V 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

s on 

butic 

nng 

berm 

II]- - 


U(v,v) < 

3 

6 

12 

21 

39 

69 

127 

228 

421 

760 

1528 

3185 

6642 

14,269 

30,352 

63,431 

132.195 
272,160 

553.195 
1115,207 
2227,484 
4427,830 
8760,826 

17,265,199 

33,951,316 

66,656,315 

132,084,407 

257,962,181 

504,141,876 

985,875,453 

1929,093,753 

3776,867,237 

7398,516,744 

14,500,416,431 

28,433,369,622 

55,779,952,400 

109,476,401,847 

214,954,581,277 


2 V ~ 1 - 1 = 

3 

7 

15 

31 

63 

127 

255 

511 

1023 

2047 

4095 

8191 

16,383 

32,767 

65,535 

131,071 

262,143 

524,287 

1048,575 

2097,151 

4194,303 

8388,607 

16,777,215 

33,554,431 

67,108,863 

134,217,727 

268,435,455 

536,870,911 

1073,741,823 

2147,483,647 

4294,967,295 

8589,934,591 

17,179,869,183 

34,359,738,367 

68,719,476,735 

137,438,953,471 

274,877,906,943 

549,755,813,887 


- 1 

v+1 

' 1 

1 

3 

5 

9 

15 

28 

51 

93 

170 

315 

585 

1092 

2047 

3855 

7281 

13,797 

26,214 

49,932 

95,325 

182,361 

349,525 

671,088 

1290,555 

2485,513 

4793,490 

9256,395 

17,895,697 

34,636,833 

67,108,863 

130,150,524 

252,645,135 

490,853,405 

954,437,176 

1857,283,155 

3616,814,565 

7048,151,460 

13,743,895,347 

26,817,356,775 


the minimal number of hidden units for which RBM^if can approxi- 
Dn on {0,1}' arbitrarily well, for 2 < v < 40. The first column gives up- 
from Corollary 12 Shown are upper bounds on U(v, v) = D(v, j) 


ma[9]and some individual bounds on D(v,j) = C(v,v — j + l,v — j ) 
rhe second column gives the previous upper bound 2 V ~ 1 — 1 from m - 


s the hard lower bound 


2 V 

v+1 


-1 


that results from parameter count- 


ding that the model RBMy^ has at least (h + l)(v + 1) — 1 > 2 V — 1 
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Besides from RBMs we can also consider models that include interactions among the 
visible variables other than biases. In this case we only need to cover the interaction sets 
from the simplicial complex S that are not already included in the simplicial complex T. In 
Theorem |S] one just replaces {A € S': |A| > 2} by S\T. We note the following special case: 


Corollary 14. Every distribution from the k-interaction model £s k on {0,1} 1/ can be ap¬ 
proximated arbitrarily well by the visible marginals of a pairwise interaction model with 
h = 3 D(v,j) hidden binary variables. The latter is bounded above as indicated in 

LemmavA In particular, every distribution on {0,1}' can be approximated arbitrarily well 
by the visible marginals of a pairwise interaction model with h = 2 1 ’ -1 — (v — 1) — 1 or 
h = 2 ( 1 °gK~ ] 2 )+ 1 ) (2" — (v + 1) — 1 — fc±IA) _|_ i hidden binary variables. 


Corollary 14 improves a previous result by Younes [13: which showed that a pairwise 
interaction model with h = 2 V — (^) — v ~ 1 hidden binary variables can approximate every 
distribution on {0,1}'^ arbitrarily well. 

We close this section with a few small examples illustrating our results. 


Example 15. The model RBM 31 is the same as the two-mixture of product distributions 
of 3 binary variables and is also known as the tripod tree model. It has 7 parameters and 
the same dimension. What is the largest hierarchical model contained in the closure of this 
model? The closure of a model M is the set of all probability distributions that can be 
approximated arbitrarily well by probability distributions from A4. 

The closure of RBMa^ contains all 3 hierarchical models on {0, l} 3 with two pairwise 
interactions. For example, it contains the model £3 with S = {{1, 2}, {1, 3}, {1}, {2}, {3}}. 
Indeed, two quadratic coefficients can be jointly modeled by one soft-plus unit (Lemma [5]) 
and the linear coefficients with the biases of the visible variables. In particular, the closure 
of RBM^i also contains the 3 hierarchical models with a single pairwise interaction. 

It does not contain the hierarchical model with 3 pairwise interactions, £s with S = S2 = 
{{1,2}, {1,3}, {2,3}, {1}, {2}, {3}}, which is known as the no-three-way interaction model. 
One way of proving this is by comparing the possible support sets of the two models, as 
proposed in [8]. The support set of a product distribution is a cylinder set. The support set 
of a mixture of two product distributions is a union of two cylinder sets. On the other hand, 
the possible support sets of a hierarchical model correspond to the faces of its marginal 
polytope, conv{(J{ i6A x^ags '■ % £ X} C R s . The marginal polytope of the no-three-way 
interaction model is the cyclic polytope C(N, d) with N = 8 vertices and dimension d = 6 
(see, e.g., 0 Lemma 18]). This is a neighborly polytope, meaning that every d/2 = 3 
or less vertices form a face. In turn, every subset of {0, l} 3 of cardinality d/2 = 3 is the 
support set of a distribution in the closure of the no-three-way interaction modelj^ince the 
set {(100), (010), (001)} is not a union of two cylinder sets, the closure of RBM 3A does not 
contain the no-three-way interaction model. 


Example 16. The closure of RBM 3 2 contains the no-three-way interaction model. Two of 
the quadratic coefficients can be jointly modeled with one hidden unit and the third with 
the second hidden unit (Lemma [5]). 

It does not contain the full interaction model. Following the ideas explained in the pre¬ 
vious example, this can be shown by analyzing the possible support sets of the distributions 
in the closure of RBM 32 . For details on this we refer the reader to j l 2 ] . 

1 More generally, in [6] it is shown that if S 5 {A C V : | A | < k } , then the marginal polytope of Eg is 

2 k — 1 neighborly, meaning that any 2 k — I or fewer of its vertices define a face. 
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Example 17. The model RBM 33 is a universal approximator. This follows immediately 
from the universal approximation bound 2 V ^ 1 — 1 from D2H- This observation can be recov¬ 
ered from our results as follows. The cubic coefficient can be modeled with one hidden unit 
(Lemma [l]). Two quadratic coefficients can be jointly modeled with one hidden unit and 
the third with another hidden unit (Lemma [5]). 

Example 18. The model RBM^ is a universal approximator. The quartic coefficient can 
be modeled with one hidden unit. The 4 cubic coefficients can be modeled with two hidden 
units (Lemma [5]). The 6 quadratic coefficients can be grouped into 3 pairs with a shared 
variable in each pair. These can be modeled with 3 hidden units (Lemma [5]). 


5 Conclusions and Outlook 

We studied the kinds of interactions that appear when marginalizing over a hidden variable 
that is connected by pair-interactions with all visible variables. We derived upper bounds on 
the minimal number of variables of a hierarchical model whose visible marginal distributions 
can approximate any distribution from a given fully observable hierarchical model arbitrarily 
well. These results generalize and improve previous results on the representational power of 
RBMs from m and El- 

Many interesting questions remain open at this point: A full characterization of soft-plus 
polynomials and the necessary number of hidden variables is missing. 

It would be interesting to look at non-binary hidden variables. This corresponds to 
analyzing the hierarchical models that can be represented by mixture models. In the case of 
conditionally independent binary hidden variables, the partial factorization leads to soft-plus 
units, whereas in the case of higher-valued hidden variables, it leads to shifted logarithms 
of denormalized mixtures. Similarly, it would be interesting to take a look at non-binary 
visible variables. In this case state vectors cannot be identified in a one-to-one manner 
with subsets of variables. This means that the correspondence between function values and 
polynomial coefficients is not as direct. 

Our analysis could also be extended to cover the representation of conditional probability 
distributions from hierarchical models in terms of conditional restricted Boltzmann machines 
and to refine the results on this problem reported in HU. 

Another interesting direction are models where the hidden variables are not conditionally 
independent given the visible variables, such as deep Boltzmann machines, which involve 
several layers of hidden variables. This case is more challenging, since the free energy does 
not decompose into independent terms. 
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A Proofs 


Proof of Lemma Q Let B' = B \ {to}. The edge coefficients satisfy 


K B '{w B ',c ) = ^2 (-1) |B Vc| log 

CCB' 



and 

K B (w B ,c) = K B '(w B ',C+W m ) - K B >(w B ',c). 

Using this structure, we now proceed with the proof of the individual cases. 

The case | B'\ = 0. We omit this simple exercise. 

The case \B'\ = 1. The if statement is as follows. The elements of the set {0,1} B are the 
vertices of the \B [-dimensional unit cube. We call two vectors x, x' £ {0, \} B adjacent if 
they differ in exactly one entry, in which case they are the vertices of an edge of the cube. 

The weights iu B and c can be chosen such that the affine map {0,1} B —> R; x B K»• 
w^x B + c maps any chosen pair of adjacent vectors to any arbitrary values and all other 
vectors to large negative values. The soft-plus function is monotonically increasing, taking 
value zero at minus infinity and plus infinity at plus infinity. Hence, for any s, s' £ R+, one 
finds weights w and c such that 


(f{x) 


S: B' i ^m) ( 1 , . . ., 1 , 1 ) 

S , {&B' i %rri) — (1? . . . , 1, 0) , 

~ 0, otherwise 


or, alternatively, such that 


4>{x) 


s, (x B ',x m ) = (1,..., 1,0,1) 

5 ; {x B f 5 ( 1 ? - - * ? 0 ) 

« 0, otherwise 


This leads to K B ss (s — s') and K B > ss s' or, alternatively, K B « — (s — s') and K B ' ~ —s'. 
The approximation can be made arbitrarily precise. 

The only if statement is as follows. Denote the soft-plus function by /: R — > R + ; s i—>■ 
log(l + exp(s)). Since \B'\ = 1, C C B' implies C = B' or C = 0. We have that 
K b > ( W B ', c) = f(w B > + c) - /(c) and K B > (w B > , c + w m ) = f{wB' + c + w m ) ~ /(c + w m ) are 
either both positive or both negative, depending on the sign of wb'- If both are positive, 
then K b (wb,c) = K B '(w B ' ,c + w m ) — K B i(wb' ,c) > —K B i(wb',c), and similarly in the 
case that both are negative. 
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The case \B'\ = 2. The if statement follows from the previous case \B'\ = 1. Indeed, 
consider an edge pair ( C,C ') with an element more than the edge pair (B, B'), such that 
B = C \ {?r} and B' = C \ {n}. Then, for any wb and c, choosing w n large enough 
one obtains an arbitrarily accurate approximation Kc((wb,w u ),c — w n ) « Kb(wb,c) and 
K C '{(w B ',w n ),c- w n ) s=s K B '(w B ',c). 

For the only if statement we use a similar argument as previously. We have Kb' (wb> , c) = 
f(w i + W 2 + c) + /(c) — /(c + W\) — f(c + W 2 )- By convexity of /, this is non-negative if and 
only if either wi,W 2 > 0 or W\ , W 2 < 0. In other words, this is non-negative if and only if 
W\-W 2 > 0. Under either of these conditions, Kb'{wb' , c + w m ) is also non-negative. Simi¬ 
larly, Kb'(wb',c) is non-positive if and only if w\ ■ w 2 < 0. In this case, Kb'{wb' ,c + w m ) 
is also non-positive. Now the statement follows as in the case \B'\ = 1. 

The case \B'\ = 3. We need to show that any edge pair coefficients can be represented. 
Consider first Jb' > 0. We choose weights of the form wb' = wIb', where w£l and 1 B > is 
the vector of \B'\ ones. Then Kb>(wb> , c) = /(3w + c) — 3/(2w + c) + 3/(w + c) — /(c). We 
can choose u> and c such that 3w + c = /“ 1 ( Jb' ) while 2 uj + c, wc, c take very large negative 
values. This yields Kb> « J B ,. 

Note that the derivative of the soft-plus function is the logistic function, i.e., f{s) = 
l/(l+exp(—s)). Choosing ui large enough from the beginning, the function w m i —> Kb'(wb' ,c+ 
w m ) is monotonically increasing in the interval w m € [0,w/2] and surpasses the value 
fee. On the other hand, when w m is large enough, depending on oj and c, we have that 
2w + c + w m > y|(3w + c + w m ) and /(2w + c + w m ) > ^/(3w + c + w TO ). In this case 
/(3w + c + w m ) — 3/(2w + c-|-r<; rn ,) < —1(3 u! + c + w m ) < —\w. At the same time, u> + c + w m 
and c + w m are smaller than — and so f(u> + c + w m ) and /(c + w m ) are very small in 
absolute value. 

By the mean value theorem, depending on w rn , Kb'{wb', c + w m ) takes any value in the 
interval [— iw, |w], where u> is arbitrarily large. In turn, we can obtain I\ B = Kb'{w' b , c + 
w m ) - K B '(w B ',c) « J B for any J B G K. 

For Jb' < 0 the proof is analogous after label switching for one variable. 

The case \B'\ > 3. This follows from the previous case \B'\ = 3 in the same way that the 
if part of the case \B'\ =2 follows from the case \B'\ — 1. □ 
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